SPEAKER'S VOICE RECOGNITION SYSTEM, METHOD AND RECORDING 
MEDIUM 

BA C KGRO U ND OF THE INVENTION 

The present invention relates to indefinite 
5 speaker's voice recognition system and method as well 
as acoustic model leaning method and recording medium 
with a voice recognition program recorded therein and, 
more particularly, to voice recognition system capable 
of normalizing speakers on frequency axis, learning 
^ 10 system for normalization, voice recognition method, 

5— -5 

"if learning method for normalization and recording medium, 

01 

'f£ in which a program for voice recognition and a learning 

y i 

O program for normalization are stored. 

SJ Spectrum converters in prior art voice recognition 

15 systems are disclosed in, for instance, Japanese Patent 
fy Laid-Open No. 6-214596 (referred to as Literature 1) and 

%i Puming Zhan and Martin Westphalk, "Speaker Normalization 

5 Based on Frequency Warping", ICASSP, 1039-1042, 1997 

(referred to as Literature 2). 
20 For example, Literature 1 discloses a voice 

recognition system, which comprises a frequency 
correcting means for correcting the frequency 
characteristic of an input voice signal on the basis of 
a plurality of predetermined different frequency 
25 characteristic correction coefficients, a frequency 
axis converting means for converting the frequency axis 
of the input voice signal on the basis of a plurality 
of predetermined frequency axis conversion coefficients, 
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a feature quantity extracting means for extracting the 
feature quantity of the input voice signal as input voice 
feature quantity, a reference voice storing means for 
storing a reference voice feature quantity, a frequency 
5 characteristic correcting means, a frequency axis 

converting means, a collating means for collating the 
input voice feature quantity obtained as a result of 
processes in the frequency characteristic correcting 
means and the reference voice feature quantity stored 
10 in the reference voice storing means, a speaker adopting 

z: phase function and a voice recognition phase function 

y * 

jy being included in the voice recognition system. In the 

O voice recognition process in this system, in the speaker 

SI' adopting phase an unknown speaker's voice signal having 

Mi 15 a known content is processed in the frequency 

m characteristic correcting means, frequency axis 

=y converting means and feature quantity extracting means 

^ for each of the plurality of different frequency 

characteristic correction coefficients and the 
20 plurality of different frequency axis conversion 

coefficients, the input voice feature quantity for each 
coefficient and a reference voice feature quantity of 
the same content as the above known content are collated 
with each other, and a frequency characteristic 
25 correction coefficient and a frequency axis conversion 
coefficient giving a minimum distance are selected. In 
the voice recognition phase, the input voice feature 
quantity is determined by using the selected frequency 
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characteristic correction coefficient and frequency 
axis conversion coefficient and collated with the 
reference voice feature quantity. 

In these prior art voice recognition systems, for 
5 improving the recognition performance the spectrum 
converter causes elongation or contraction of the 
spectrum of the voice signal on the frequency axis with 
respect to the sex, age, physical conditions, etc. of 
the individual speakers. For spectrum elongation and 
10 contraction on the frequency axis, a function, which 
2JJ permits variation of the outline of the elongation and 

ri ■ contraction with an adequate parameter, is defined to 

O be used for elongation or contraction of the spectrum 

SJ of the voice signal on the frequency axis. The function 

M 15 which is used for elongating or contracting the spectrum 

Si of the voice signal on the frequency axis is referred 

s in! 

~ to as "warping function", and the parameter for defining 

~ the outline of the warping function is referred to as 

"elongation/contraction parameter" . 
20 Heretofore, a plurality of warping parameter 

values are prepared as elongation/contraction parameter 
of the warping function ("warping parameter"), the 
spectrum of the voice signal is elongated or contracted 
on the frequency axis by using each of these values, an 
25 input pattern is calculated by using the elongated or 
contracted spectrum and used together with reference 
pattern to obtain distance, and the value corresponding 
to the minimum distance is set as warping parameter value 
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at the time of the recognition. 

The spectrum converter in the prior art voice 
recognition system, will now be described with reference 
to the drawings. Fig. 9 is a view showing an example of 
5 the construction of the spectrum converter in the prior 
art voice recognition system. Referring to Figure 9, 
this spectrum converter in the prior art, comprises an 
FFT (Fast Fourier Transform) unit 301 , an 
elongation/contraction parameter memory 302, a 
= 10 frequency converter 303, an input pattern calculating 

^ unit 304, a matching unit 306, a reference pattern unit 

yj "305 and an elongation/contraction parameter selecting 

D unit 307. The FFT unit 301 cuts out the input voice 

Cj signal for every unit interval of time and causes Fourier 

15 transform of the cut-out signal to obtain a frequency 
" spectrum. 

yi A plurality of elongation/contraction parameter 

D values for determining the elongation or contraction of 

frequency are stored in the elongation/contraction 
20 parameter memory 302. The frequency converter 303 

executes a frequency elongation/contraction process on 
the spectrum fed out from the FFT unit 3 01 using a warping 
function with the outline thereof determined by 
elongation/contraction parameter, and feeds out a 
25 spectrum obtained after the frequency 
elongation/contraction process as 

elongation/contraction spectrum. The input pattern 
calculating unit 304 calculates and outputs an input 
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pattern by using the elongation/contraction spectrum fed 
out from the frequency converter 303- The input pattern 
represents , for instance, a parameter time series 
representing an acoustical feature such as cepstrum. 
5 The reference pattern is formed by using a large 

number of input patterns and averaging phoneme unit input 
patterns belonging to the same class by a certain type 
of averaging means . For the preparation of the reference 
pattern, see "Fundamentals of Voice Recognition", Part 
= 10 I, translated and edited by Yoshii, NTT Advanced 

jtf Technology Co., Ltd., 1995, pp. 63 (Literature 3). 

Reference patterns can be classified by the 
O recognition algorithm. For example, time series 

y s 

sl reference patterns with input patterns arranged in the 

%^ 15 phoneme time series order are obtainable in the case of 

51 DP (Dynamic Programming) matching, and status series and 

s Ft: 

jfjj connection data thereof are obtainable in the HMM (hidden 

O Markov Model) case. 

The matching unit 306 calculates distance by using 
20 reference pattern 3 05 matched to the content of voice 
inputted to the FFT unit 301 and the input pattern. The 
calculated distance corresponds to likelihood in the HMM 
(hidden Marcov model) case concerning the reference 
pattern and to the distance of the optimum route in the 
25 DP matching case. The elongation/contraction parameter 
selecting unit 307 selects a best matched 
elongation/contraction parameter in view of matching 
property obtained in the matching unit 306. 
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Fig. 10 is a flow chart for describing a process 
executed in a prior art spectrum matching unit. The 
operation of the prior art spectrum matching unit will 
now be described with reference to Figs. 9 to 10. The 
5 FFT unit 301 executes the FFT operation on voice signal 
to obtain the spectrum thereof (step D101 in Fig. 10). 
The frequency converter 3 03 executes elongation or 
contraction of the spectrum on the frequency axis by using 
input elongation/contraction parameter (D106) (step 

10 D102). The input pattern calculating unit 304 

calculates the input pattern by using the frequency axis 
elongated or contracted spectrum (step D103). The 
matching unit 305 determines the distance between 
reference pattern (D107) and the input pattern (D104). 

15 The sequence of processes from step D101 to step D104, 
is executed for all the elongation/contraction parameter 
values stored in the elongation/contraction parameter 
memory 3 02 (step D105). 

When 10 elongation/contraction parameter values 

20 are stored in the elongation/contraction parameter 

memory 302 , the process sequence from step D101 to D104 
is repeated 10 times to obtain 10 different distances. 
The elongation/contraction parameter selecting unit 307 
compares the distances corresponding to all the 

25 elongation/contraction parameters, and selects the 

elongation/contraction parameter corresponding to the 
shortest distance (step D108). 

However, the above prior art spectrum converter has 
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the following problems. 

The first problem is that increased computational 
effort is required in the elongation/contraction 
parameter value determination. This is so because in the 
5 prior art spectrum converter it is necessary to prepare 
a plurality of elongation/contraction parameter values 
and execute the FFT process, the spectrum frequency 
elongation/contraction process, the input pattern 
calculation repeatedly a number of times corresponding 
10 to the number of these values. 

The second problem is that it is possible to fail 
to obtain sufficient effects of the frequency elongation 
p and contraction on the voice recognition system. This 

S| is so because the elongation/contraction parameter 

yL 15 values are all predetermined, and none of these values 

may be optimum to an unknown speaker. 
^ SUMMARY OF THE INVENTION 

3 The present invention was made in view of the above 

problems, and its main object is to provide voice 

20 recognition system and method and also recording medium, 
which permits calculation of the optimum 
elongation/contraction parameter value for each speaker 
with less computational effort and can thus improve 
performance. The above and other objects and features 

25 of the present invention will now become immediately 
apparent from the following description. 

According to a first aspect of the present 
invention, there is provided a voice recognition system 
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comprising a spectrum converter for elongating or 
contracting the spectrum of a voice signal on the 
frequency axis, the spectrum converter including: an 
analyzer for converting an input voice signal to an input 
5 pattern including cepstrum; a reference pattern memory 
with reference patterns stored therein; an 
elongation/contracting estimating unit for outputting 
an elongation/contraction parameter in the frequency 
axis direction by using the input pattern and the 
10 reference patterns; and a converter for converting the 
input pattern by using the elongation/contraction 
parameter . 

According to a second aspect of the present 
invention, there is provided a voice recognition system 

15 comprising: an analyzer for converting an input voice 
signal to an input pattern including a cepstrum; a 
reference pattern memory for storing reference patterns; 
an elongation/contraction estimating unit for 
outputting an elongation/contraction parameter in the 

20 frequency axis direction by using the input pattern and 
reference patterns; a converter for converting the input 
pattern by using the elongation/contraction parameter; 
and a matching unit for computing the distances between 
the elongated or contracted input pattern fed out from 

25 the converter and the reference patterns and outputting 
the reference pattern corresponding to the shortest 
distance as result of recognition. 

The converter executes the elongation or 
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contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. The elongation/contraction estimating unit 
5 executes the elongation or contraction of spectrum on 
frequency axis with warping function defining the form 
of elongation or contraction by using estimation derived 
from the best likelihood estimation of HMM ( hidden Marcov 
model) in cepstrum space, 

10 According to a third aspect of the present 

invention, there is provided a reference pattern 
learning system comprising: a learning voice memory with 
learning voice data stored therein; an analyzer for 
receiving a learning voice signal from the learning voice 

15 memory and converting the learning voice signal to an 
input pattern including cepstrum; a reference pattern 
memory with reference patterns stored therein; an 
elongation/contraction estimating unit for outputting 
an elongation/contraction parameter in frequency axis 

20 direction by using the input pattern and the reference 
patterns; a converter for converting the input pattern 
by using the elongation/contraction pattern; a reference 
pattern estimating unit for updating the reference 
patterns stored in the reference pattern memory for the 

25 learning voice data by using the elongated or contracted 
input pattern fed out from the converter and the reference 
patterns; and a likelihood judging unit for monitoring 
distance changes by computing distances by using the 
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elongated or contracted input pattern fed out from the 
/ converter and the reference patterns. 

The converter executes the elongation or 
contraction of spectrum on frequency axis with warping 
5 function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. The elongation/contraction estimating unit 
executes the elongation or contraction of spectrum on 
frequency axis with warping function defining the form 
10 of elongation or contraction by using estimation derived 
Jif from the best likelihood estimation of HMM ( hidden Marcov 

*M model) in cepstrum space. 

O According to a fourth aspect of the present 

jjTS 

Si invention, there is provided a voice quality converting 

15 system comprising; an analyzer for converting an input 
Sj voice signal to an input pattern including a cepstrum; 

%1 a reference pattern memory for storing reference 

M patterns; an elongation/contraction estimating unit for 

outputting an elongation/contraction parameter in the 
20 frequency axis direction by using the input pattern and 
reference patterns; a converter for converting the input 
pattern by using the elongation/contraction parameter; 
and an inverse converter for outputting a signal waveform 
in time domain by inversely converting the time serial 
25 input pattern obtained after the elongation/contraction 
supplied from the converter. 

According to a fifth aspect of the present 
invention, there is provided a recording medium for a 
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computer constituting a spectrum converter by executing 
elongation or contraction of the spectrum of a voice 
signal on frequency axis, in which is stored a program 
for executing the following processes: (a) an analyzing 
process for converting an input voice signal to an input 
pattern including cepstrum; (b) an 
elongation/contraction estimating process for 
outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 
reference patterns stored in a reference pattern memory; 
and (c) a converting process for converting the input 
pattern by using the elongation/contraction parameter. 

According to a sixth aspect of the present 
invention, there is provided a recording medium for a 
computer constituting a system for voice recognition by 
executing elongation or contraction of the spectrum of 
a voice signal on frequency axis, in which is stored a 
program for executing the following processes: (a) an 
analyzing process for converting an input voice signal 
to an input pattern including cepstrum; (b) an 
elongation/contraction estimating process for 
outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 
reference patterns stored in a reference pattern memory; 
(c) a converting process for converting the input pattern 
by using the elongation/contraction parameter; and (d) 
a matching process for computing the distances between 
the elongated or contracted input pattern and the 



11 



reference patterns and outputting the reference pattern 
corresponding to the shortest distance as result of 
recognition. 

The converting process executes the elongation or 
5 contraction of spectrum on freguency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 
space. The elongation/contraction estimating process 
executes the elongation or contraction of spectrum on 

10 frequency axis with warping function defining the form 
of elongation or contraction by using estimation derived 
from the best likelihood estimation of HMM (hidden Marcov 
model) in cepstrum space. 

According to a seventh aspect of the present 

15 invention, there is provided, in a computer constituting 
a system for learning reference patterns from learning 
voice data, a recording medium, in which is stored a 
program, for/ executing the following processes: (a) an 
analyzing process for receiving learning voice data from 

20 learning voice memory with learning voice data stored 
therein and converting the received learning voice data 
to an input pattern including cepstrum; (b) an 
elongation/contraction estimating process for 
outputting an elongation/contraction parameter in 

25 frequency axis direction by using the input pattern and 
the reference patterns stored in the reference pattern 
memory; (c) a converting process for converting the input 
pattern by using the elongation/contraction parameter; 
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(d) a reference pattern estimating process for updating 
the reference patterns for the learning voice data by 
using the elongated or contracted pattern fed out in the 
converting process and the reference patterns and; (e) 
5 a likelihood judging process for calculating the 

distances between the elongated or contracted input 
pattern after conversion in the converting process and 
the reference patterns and monitoring changes in 
distance. 

m 10 The converting process executes the elongation or 

S ... 
Ir contraction of spectrum on frequency axis with warping 

?y function defining the form of elongation or contraction 

O by carrying out the elongation or contraction in cepstrum 

Ol' 

%j space. The elongation/contraction estimating process 

M= 15 executes the elongation or contraction sOf spectrum on 

fjj frequency axis with warping function defining the form 

%i of elongation or contraction by using estimation derived 

w - from the best likelihood estimation of HMM (hidden Marcov 

model) in cepstrum space. 
20 According to an eighth aspect of the present 

invention, there is provided a recording medium for a 
computer constituting a spectrum conversion by executing 
elongation or contraction of the spectrum of a voice 
signal on frequency axis, in which is stored a program 
25 for executing the following processes: (a) an analyzing 
process for converting an input voice signal to an input 
pattern including cepstrum; (b) an 
elongation/contraction estimating process for 
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outputting an elongation/contraction parameter in 
frequency axis direction by using the input pattern and 
reference patterns stored in a reference pattern memory; 
(c) a converting process for converting the input pattern 
5 by using the elongation/contraction parameter; and (d) 
an inverse converting process for outputting a signal 
waveform in time domain by inversely converting the time 
serial input pattern obtained after the 
elongation/contraction supplied from the converter. 

10 According to a ninth aspect of the present 

invention, there is provided a spectrum converting 
method for elongating or contracting the spectrum of a 
voice signal on the frequency axis, comprising: a first 
step for converting an input voice signal to an input 

15 pattern including cepstrum; a second step for outputting 
an elongation/contraction parameter in the frequency 
axis direction by using the input pattern and the 
reference patterns stored in a reference pattern memory; 
and a third step for converting the input pattern by using 

20 the elongation/contraction parameter. 

According to a tenth aspect of the present 
invention, there is provided a voice recognition method 
comprising: a first step for converting an input voice 
signal to an input pattern including a cepstrum; a second 

25 step for outputting an elongation/contraction parameter 
in the frequency axis direction by using the input pattern 
and reference patterns stored in a reference pattern 
memory; a third step for converting the input pattern 
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by using the elongation/contraction parameter; and a 
fourth step for computing the distances between the 
elongated or contracted input pattern and the reference 
patterns and outputting the reference pattern 
5 corresponding to the shortest distance as result of 
recognition . 

The e elongation or contraction of spectrum on 
frequency axis with warping function defining the form 
of elongation or contraction is executed by carrying out 

10 the elongation or contraction in cepstrum space. The 
elongation/contraction estimating process executes the 
elongation or contraction of spectrum on frequency axis 
with warping function defining the form of elongation 
or contraction by using estimation derived from the best 

15 likelihood estimation of HMM (hidden Marcov model) in 
cepstrum space . 

According to an eleventh aspect of the present 
invention, there is provided a reference pattern 
learning method comprising: a first step for receiving 

20 a learning voice signal from the learning voice memory 
and converting the learning voice signal to an input 
pattern including cepstrum; a second step for outputting 
an elongation/contraction parameter in frequency axis 
direction by using the input pattern and the reference 

25 patterns stored in a reference pattern memory; a third 
step for converting the input pattern by using the 
elongation/contraction pattern; a fourth step for 
updating the reference patterns for the learning voice 
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data by using the elongated or contracted input pattern 
and the reference patterns; and a fifth step for 
monitoring distance changes by computing distances by 
using the elongated or contracted input pattern and the 
5 reference patterns . 

The third step executes the elongation or 
contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by carrying out the elongation or contraction in cepstrum 

10 space. The second step executes the elongation or 

contraction of spectrum on frequency axis with warping 
function defining the form of elongation or contraction 
by using estimation derived from the best likelihood 
estimation of HMM (hidden Marcov model) in cepstrum 

15 space. 

According to a twelfth aspect of the present 
invention, there is provided a voice recognition method 
of spectrum conversion to convert the spectrum of a voice 
signal by executing elongation or contraction of the 

20 spectrum on frequency axis, wherein: the spectrum 

elongation or contraction of the input voice signal as 
defined by a warping function is executed on cepstrum, 
the extent of elongation or contraction of the spectrum 
on the frequency axis is determined with 

25 elongation/contraction parameter included in warping 
function, and an optimum value is determined as 
elongation/contraction parameter value for each 
speaker . 
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Other objects and features will be clarified from 
the following description with reference to attached 
drawings . 

BRIEF DESCRIPTION OF THE DRAWINGS 

5 Fig. 1 is a view showing the construction of a 

spectrum converter in a first embodiment of the voice 
recognition system according to the present invention; 

Fig. 2 is a flow chart for explaining the process 
in the first embodiment of the present invention; 
10 Fig. 3 is a view showing the construction of the 

second embodiment of the present invention; 

Fig. 4 is a flow chart for describing the process 
sequence in the second embodiment of the present 
invention; 

15 Fig. 5 is a view showing the construction of the 

third embodiment of the present invention; 

Fig. 6 is a flow chart for describing the process 
in the third embodiment of the present invention 

Fig. 7 is a view showing the construction of the 
20 fourth embodiment of the present invention; 

Fig. 8 is a view showing the construction of the 
fifth embodiment of the present invention; 

Fig. 9 is a view showing an example of the 
construction of the spectrum converter in the prior art 
25 voice recognition system; and 

Fig. 10 is a flow chart for describing a process 
executed in a prior art spectrum matching unit. 

PREFERRED EMBODIM ENTS OF THE INVENTION 
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Embodiments of the present invention will now be 
described in detail with reference to the drawings. 

A system according to the present invention 
generally comprises an analyzer unit 1 for converting 
5 an input voice signal to an input pattern containing 
cepstrum, an elongation/contraction estimating unit 3 
for outputting an elongation/contraction parameter in 
the frequency axis direction by using an input pattern 
and a reference pattern, and a converter unit 2 for 
10 converting an input pattern by using an 
elongation/contraction parameter . 

The system further comprises a matching unit (i.e., 
recognizing unit 101) for calculating the distance 
between the input pattern converted by the converter 2 
15 and reference patterns and outputting the reference 
pattern corresponding to the shortest distance as result 
of recognition. 

The elongation/contraction estimating unit 3 
estimates an elongation/contraction parameter by using 
20 a cepstrum contained in the input pattern. Thus, 

according to the present invention it is not necessary 
to store various values in advance when determining the 
elongation/contraction parameter. Neither it is 
necessary to execute distance calculation in connection 
25 with various values. 

Furthermore, the system according to the present 
invention comprises a leaning voice memory 201 for 
storing learning voices, an analyzer 1 for receiving the 

18 



leaning voice data from the learning voice memory 201 
and converting the received data to input pattern 
including cepstrum, a reference pattern memory 4 for 
storing reference patterns, an elongation/contraction 
5 estimating unit 3 for outputting an 

elongation/contraction parameter in the frequency axis 
direction by using the input pattern and the reference 
pattern, a converter 2 for converting an input pattern 
by using the elongation/contraction parameter, a 

10 reference pattern memory for storing the reference 

patterns, a reference pattern estimating unit 202 for 
updating the reference pattern for voice for learning 
by utilizing the input pattern after elongation or 
contraction fed out from the converter and the reference 

15 patterns, and a likelihood judging unit 2 03 for computing 
the distance by utilizing the input pattern after 
elongation or contraction and the reference patterns and 
monitoring changes in the distance. 

Fig. 1 is a view showing the construction of a 

20 spectrum converter in a first embodiment of the voice 
recognition system according to the present invention. 
Referring to Fig. 1, the spectrum converter in the first 
embodiment of the voice recognition system comprises an 
analyzer 1, a converter 2, an elongation/contraction 

25 estimating unit 3 and a reference pattern memory 4 . 

The analyzer 1 cuts out a voice signal for every 
predetermined interval of time, obtains the spectrum 
component of the cut-out signal by using FFT ( Fast Fourier 
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Transform) or LPC (Linear Predictive Coding) analysis, 
obtains a melcepstrum for extracting the envelope 
component of the melspectrunm component through 
conversion to melscale taking the human acoustical sense 
into account, and feeds out the melcepstrum, the change 
therein, the change in the change, etc. as input pattern. 
The converter 2 executes elongation or contraction of 
freguency by converting the melcepstrum in the input 
pattern. An example of conversion executed in the 
converter 2 will now be described in detail. 

According to Oppenheim "Discrete Representation of 
Signals", Proc . IEEE, 60, 681-691, June 1972 (Literature 
4), the frequency conversion with a primary full 
band-pass filter as represented by Formula (1) given 
below, can be expressed by Formula (2) as a recursive 
expression using cepstrum (symbol c and subscripts being 
dimension numbers of cepstrum) . 

l-az~ l 



(.■-l) / (i-i) (i) \ 



m = 0 



Cm-l 



+ a 



l = -oo, 



Cm C m-\ 

-1,0. 



m =1 



m^2 



(2) 



The conversion in the cepstrum space given by 
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Formula ( 2 ) is equivalent to the frequency of the spectrum 
given by Formula (1). Accordingly, the converter 102 
executes elongation or contraction of the spectrum 
frequency without direct use of the spectrum but by 
5 executing the conversion given by Formula (2) derived 
from Formula (1) on the input pattern with Formula (1) 
as warping function and with a in Formula (1) as 
elongation/contraction parameter. The input pattern 
obtained after the conversion is fed out as converted 
Q 10 input pattern. 

m Reference patterns are stored in the reference 

/S pattern memory 4 . The reference patterns can be 

y substituted for by hidden Marcov models (or HMMs ) or time 

series reference patterns such as phoneme time series 
H : 15 as phonetic data in units of words or phonemes. In this 

fU embodiment, the reference patterns are HMMs. Data 

p constituting HMM may be the average vector in continuous 

^ Gauss distribution, variance, inter-state transition 

probability, etc. 
20 The elongation/contraction estimating unit (or 

also referred to as elongation/contraction parameter 
estimating unit) 3, obtains alignment of the input 
pattern by using HMM corresponding to the voice signal 
inputted to the analyzer 1. By the term "alignment" is 
25 meant the post-probability at each instant and in each 
state of HMM. 

The alignment may be obtained by using such 
well-known method as Viterbi algorithm and 



• 
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forward/backward algorithm described in "Fundamentals 
of Voice Recognition (Part II), translated and edited 
by Furui, NTT Advanced Technology Go . , Ltd., 1995, pp. 
102-185 (Literature 5). 

The elongation/contraction parameter is 
calculated by using the obtained alignment, the HHM and 
the input pattern. The elongation/contraction 
parameter is calculated by using Formula (4). 



c 0 = 2« m c^ 
q=(i-« 2 )i>« m ~ l a,> 

m=l 

A 

c 2 = c 2 +a (~c l + 3 c 3 ) + « 2 (-*c 2 + 6 c 4 ) + " 3 (d - 9 C 3 + 10 c 5 ) + - 



C 3 - C 3 + «(" 2 C 2 + 4 C 4 ) + " 2 (Ci - 9 C 3 + 10 C 5 ) + " 3 ( 6 C 2 - 24 C4 + 20 C6 ) + , (3) 



C - C - Um+^c^-Cm-l^K >n>0 (4) 



Formula (4) is derived by developing the recursive 
15 equation of Formula (2) with respect to the 

elongation/contraction parameter as in Formula (3), 
approximating the result of development with the first 
degree term of a, introducing the result in Q function 
of HMM for likelihood estimation as described in 
20 Literature 4 and maximizing the Q function. 

The function thus derived is given by Formula (5). 
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mt 



A c mt = c mt -v jhn , 

^ mt v / l^(/n-l) 



-(m+l) c 



(m+l)f 



(5) 



In Formula (5) , c represents the melcepstrum part 
of the above input pattern, JUL represents the average 
vector of HMM, a represents the variation of HMM, and 
y represents the post-probability at instant t and in 
state j and mixed state k as alignment data. 

The post-probability is presence probability at a 
certain instant and in a certain state in the case of 
the forward/backward algorithm, and in the case of 
viterbi algorithm it is "1" in the case of presence in 
an optimum route at a certain instant and in a certain 
time and "0" otherwise. 

While Formula ( 1 ) was given as the warping function 
in this embodiment, it is by no means limitative, and 
according to the present invention it is possible to adopt 
any formula. Also, while the first degree approximation 
of Formula (2) was used to derive Formula (5), it is also 
possible to use second and higher degree approximations. 

Fig. 2 is a flow chart for explaining the process 
in the first embodiment of the present invention. The 
overall operation of the first embodiment will now be 
described in detail with reference to Figs. 1 and 2. 
Subsequent to the input of a voice signal (step A101 in 
Fig. 2), the analyzer 1 calculates the input pattern 
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(A102). Then, the elongation/contraction estimating 
unit 3 calculates the elongation/contraction pattern by 
using the input pattern fed out from the analyzer 1 and 
inputted HMM (A105) (step A103). Then, the converter 2 
obtains converted input pattern from the input pattern 
from the analyzer 1 by using the conversion function of 
either one of Formulas (2) to (4) (stepAl04). The value 
of a is "0" in the case of the first utterance, while 
using values fed out from the elongation/contraction 
estimating unit 3 as a in the cases of the second and 
following utterances. 

The first embodiment of the present invention has 
the following effects. In the first embodiment, the 
input pattern fed out from the analyzer 1 is inputted 
to the converter 2 , and the spectrum frequency elongation 
and contraction may be executed in a melcepstrum range. 
Where Formula (5) is used, repeat calculation as 
described before in the prior art is unnecessary, and 
analysis and other processes need be executed only once. 
It is thus possible to reduce computational effort for 
the elongation/contraction parameter estimation. 

A second embodiment of the present invention will 
now be described. Fig. 3 is a view showing the 
construction of the second embodiment of the present 
invention. The second embodiment of the voice 
recognition system comprises an analyzer 1, converter 
2, an elongation/contraction estimating unit 3, a 
recognizing unit 101 and a reference pattern memory 4. 
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The analyzer 1, a converter 2, elongation/contraction 
estimating unit 3 and reference pattern memory 4 are the 
same as those described before in the description of the 
first embodiment. Specif ically, like the first 
5 embodiment, the analyzer 1 analyzes the voice signal , 
and then calculates and feeds out the input pattern. 
Also like the first embodiment, the converter 2 converts 
the input pattern, and feeds out the converted input 
pattern. Furthermore, like the first embodiment, HMM 
„ 10 constituted by average vector of the input pattern, 

Jjf: variance, etc. is stored as elements representing 

y£ phoneme in the reference pattern memory 4 . 

O The recognizing unit (or matching unit) 101 

SJ executes recognition by checking which HMM is well 

U 15 matched to the converted input pattern fed out from the 

converter. The matching is executed by such as well- 
J| known method as Viterbi algorithm or forward/backward 

«SS3 T 

H algorithm shown in Literature 4. 

Fig. 4 is a flow chart for describing the process 
20 sequence in the second embodiment of the present 

invention. Referring to Figs. 3 and 4, the overall 
operation of the second embodiment of the present 
invention will be described in detail. 

The analyzer 1 analyzes the input voice signal 
25 (step B101 in Fig. 4) and calculates the input pattern 
(step B102). The converter 2 obtains the converted 
pattern from the input pattern fed out from the analyzer 
1 by using the conversion function of either one of 



Formulas (2) to (4) (step B103). The value of a is "0" 
in the case of the first voice, while warping parameter 
values fed out from the elongation/contraction 
estimating unit 3 are used as a in the cases of the second 
5 and following voices. Then, the recognizing unit 101 
executes a recognizing process by using the converted 
input pattern (step B104) . At this time, HMM is inputted 
from the reference pattern memory 4 to the recognizing 
unit 101 (step B106). Subsequent to the recognizing 

pi 10 process, the elongation/contraction parameter 

estimating unit 3 calculates the elongation/contraction 

j2 parameter is calculated (step B105). Thereafter, the 

process is repeated from the voice input process in step 

HI B101 by using the elongation/contraction parameter 

1=9. 15 obtained and the step B105. 

fy The second embodiment has the following functional 

pi effect. The second embodiment of the present invention 

^ comprises the spectrum converter 100 and the recognizing 

unit 101 in the first embodiment. Thus, whenever the 
20 voice signal is inputted, the value of the 

elongation/contraction parameter is updated, and it is 
possible to correct frequency deviation with respect to 
the reference pattern. The recognition performance is 
thus improved. 
25 In addition, in the second embodiment of the 

present invention the elongation/contraction parameter 
estimation is executed by using Formula (5) for making 
the HMM maximum likelihood estimation Q function minimum. 

26 




Thus, the elongation/contraction parameter estimation 
can be obtained as continuous values, and it is thus 
possible to expect recognition performance improvement 
compared to the case of using preliminarily prepared 
5 discrete values. 

A third embodiment of the present invention will 
now be described. Fig. 5 is a view showing the 
construction of the third embodiment of the present 
invention. Referring to Fig. 5, in the third embodiment 
n 10 the present invention is applied to a pattern learning 

JJf system, which comprises a learning voice memory 201, a 

™! reference pattern estimation unit 202 and a likelihood 

O judging unit 2 03 in addition to the spectrum converter 

SJ 100 in the first embodiment. 

15 The learning voice memory 201 stores voice signals 

=j used for learning HMM. The reference pattern estimating 

unit 2 0 estimates HMM parameter by using converted input 
^ pattern fed out from the spectrum converter 100 and HMM. 

The estimation may be best likelihood estimation as 
20 described in Literature 4. The likelihood judging unit 
203 obtains distances corresponding to all learning 
voice signals by using the converted input pattern fed 
out from the spectrum converter 100 and HMM. Where the 
reference patterns are those in the HMM case, the distance 
25 is obtained by using such a method as Viterbi algorithm 
or forward/backward algorithm as described in Literature 
5. 

While the third embodiment of the present invention 

27 



has been described in connection with the learning of 
HMM, the present invention is applicable to the learning 
of any parameter concerning voice recognition. 

Fig. 6 is a flow chart for describing the process 
in the third embodiment of the present invention. The 
entire operation of the third embodiment of the present 
invention will now be described in detail with reference 
to Figs. 5 and 6. First, a learning voice signal is 
inputted to the spectrum analyzer 1 in the spectrum 
converter 100 (step C101 in Fig. 6). The analyzer 1 
analyzes the learning voice signal and feeds out an input 
pattern (step C102). The elongation/contraction 
estimating unit 3 estimates the elongation/contraction 
parameter (step C103). The converter 2 executes input 
pattern conversion and feeds out a converted input 
pattern (step C104). The reference pattern estimating 
unit 2 02 executes HMM estimation by using the converted 
input pattern and HMM (step C105). The likelihood 
judging unit 203 obtains likelihood corresponding to all 
the voice signals, and compares the change in likelihood 
and a threshold (C106). When the change in likelihood 
is less than the threshold, the reference pattern memory 
4 is updated with the HMM estimated in the reference 
pattern estimating unit 2 02, thus bringing an end to the 
learning. When the change in likelihood is greater than 
the threshold, the likelihood judging unit 203 updates 
the reference pattern memory 4 with HMM estimated by the 
reference pattern estimating unit 202, and the sequence 
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of processes is repeated from the learning voice data 
input process (C101). 

The third embodiment of the present invention has 
the following effects. In the third embodiment of the 
5 present invention, when learning a reference pattern 
obtained for each speaker after correction of the effects 
of frequency elongation and contraction with a warping 
function, the elongation/contraction parameter 
estimation can be executed during the learning process, 
p 10 Thus, it is possible to reduce the computational effort 

gX compared to the prior art. In addition, Formula (5) used 

for the elongation/contraction parameter estimation is 
derived by using the best likelihood of HMM, and like 
other HMM parameter estimation cases it can be readily 

H 15 adapted for use in the course of learning. 

r"i 

fy A fourth embodiment of the present invention will 

q now be described. Fig. 7 is a view showing the 

^ construction of the fourth embodiment of the present 

invention. Referring to Fig. 7, the fourth embodiment 
20 of the present invention comprises an inverse converter 
45 in addition to the construction of the first embodiment. 
The inverse converter 5 executes voice quality 
conversion by inversely converting the elongated or 
contracted input pattern time series fed out from the 
25 converter 2 and outputting a signal waveform in time 
domain . 

A fifth embodiment of the present invention will 
now be described. Fig. 8 is a view showing the 
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construction of the fifth embodiment of the present 
invention. In the fifth embodiment of the present 
invention, the above first to fourth embodiments of 
systems are realized program control executed with a 
5 computer. Referring to Fig. 8, in the case of realizing 
the processes in the analyzer 1, the converter 2 and the 
elongation/contraction estimating unit 3 shown in Fig. 
1 by executing program on a computer 10, the program is 
loaded from a recording medium 14 , such as CD-ROM, DVD, 
f=l 10 FD, Magnetic tape, etc. via a recording medium accessing 

^ unit 13 in a main memory 12 of the computer 10, and is 

fi executed in a CPU 11. In the recording medium 14 is 

O stored a program for executing, with the computer, an 

SI analysis process for converting an input voice signal 

M* 15 to an input pattern including cepstrum, an 

m elongation/contraction estimating process for 

~ outputting an elongation/contraction parameter in the 

~ frequency axis direction by using the input pattern and 

the reference pattern stored in a reference pattern 
20 memory . 

Alternatively, it is possible to record a program, 
for causing execution, with a computer, a matching 
process of computing the distance between the input 
pattern fed out after elongation or contraction and each 
25 reference pattern and outputting the reference pattern 
corresponding to the shortest distance as result of 
recognition . 

A program for causing execution, with the computer, 
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the matching processing for the distance calculation 
between the input pattern after the 

elongation/contraction and the reference pattern , and 
outputting the reference pattern having the minimum 
5 distance as a recognition result, may be recorded in the 
recording medium. 

As a different alternative, it is possible to store 
in the recording medium 14 a program for causing execution, 
with the computer, an analysis process for converting 
q 10 a learning voice data stored in a learning voice memory 

JR for storing learning voice data to an input pattern 

f~ containing cepstrum, an elongation/contraction 

!jf estimating process for outputting an 

Nl elongation/contraction parameter in the frequency axis 

S 

M= 15 direction by using the input pattern and the reference 

nj pattern stored in a reference pattern memory, a 

Li ^ 

q converting process for converting the input pattern by 

" using the elongation/contraction parameter, a reference 

pattern estimating process for updating the reference 
20 pattern with respect to the learning voice by using 
elongated or contracted input pattern fed out after the 
conversion process and the reference patterns, and a 
likelihood judging process of monitoring changes in 
distance by computing the distance through utilization 
25 of the elongated or contracted input pattern and 

reference patterns. It will be seen that in the second 
to fourth embodiments it is possible to realize like 
program control. It is also possible to down-load 
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program from a server (not shown) via a network or like 
transfer medium. In other words , as the recording medium 
may be used any recording medium, such as communication 
medium, so long as it can hold program. 
5 As has been described in the foregoing, according 

to the present invention it is possible to obtain the 
following advantages . 

A first advantage is to reduce computational effort 
required for the calculation of optimum parameter for 

10 recognition performance in the voice signal spectrum 
frequency elongation or contraction. This is so because 
according to the present invention it is adopted that 
the conversion in primary full band-pass or like filter 
process with respect to the frequency axis can be solved 

15 in the form of elongation/contraction parameter power 
series in cepstrum domain. Thus, when the series is 
approximated by a first degree function, a function of 
elongation/contraction parameter for minimizing the 
function for the best likelihood estimation can be 

20 described in a ready function to be used for calculation. 

A second advantage is to make it possible to 
estimate elongation/contraction parameter 
simultaneously with other parameters at the time of the 
HMM learning. This is so because according to the 

25 present invention the function for calculating the 

elongation/contraction parameter is derived from the Q 
function for the best likelihood estimation in voice 
recognition . 
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Changes in construction will occur to those skilled 
in the art and various apparently different modifications 
and embodiments may be made without departing from the 
scope of the present invention. The matter set forth in 
the foregoing description and accompanying drawings is 
offered by way of illustration only. It is therefore 
intended that the foregoing description be regarded as 
illustrative rather than limiting. 
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